Welcome to the Jupyter Lab exercise where you execute your first data analysis of software data in a Data Science way!
Technology choices are different. There may be objective reasons for technology at a specific time. But those reasons often change over time. But the developed deep love for an now outdated technology can prevent every progress. Thus objective reasons may become subjective which can create a toxic environment when technology updates are addressed.
You are a new team member in a software company. The developers there are using a version control system ("VCS" for short) called CVS (Concurrent Versions System). Some want to migrate to a better VCS. They prefer one that's called SVN (Subversion). You are young but not inexperienced. You heard about newer version control system named "Git". So you propose Git as an alternative to the team. They are very sceptical about your suggestion. Find evidence that shows that the software development community is mainly adopting the Git version control system!
There is a dataset from the online software developer community Stack Overflow in ../datasets/stackoverflow_vcs_data_subset.gz
available with the following data:
CreationDate
: the timestamp of the creation date of a Stack Overflow post (= question)TagName
: the tag name for a technology (in our case for only 4 VCSes: "cvs", "svn", "git" and "mercurial")ViewCount
: the numbers of views of a postThese are the first 10 entries of this dataset:
CreationDate,TagName,ViewCount
2008-08-01 13:56:33,svn,10880
2008-08-01 14:41:24,svn,55075
2008-08-01 15:22:29,svn,15144
2008-08-01 18:00:13,svn,8010
2008-08-01 18:33:08,svn,92006
2008-08-01 23:29:32,svn,2444
2008-08-03 22:38:29,svn,871830
2008-08-03 22:38:29,git,871830
2008-08-04 11:37:24,svn,17969
In [ ]:
number_of_views = vcs_data.groupby(['CreationDate', 'TagName']).sum()
number_of_views.head()
In [ ]:
%matplotlib inline
monythly_views.plot(title="monthly stackoverflow post views");
In [ ]:
vcs_data['CreationDate'] = pd.to_datetime(vcs_data['CreationDate'])
vcs_data.head()
In [ ]:
import pandas as pd
vcs_data = pd.read_csv('../datasets/stackoverflow_vcs_data_subset.gz')
vcs_data.head()
In [ ]:
views_per_vcs = number_of_views.unstack()['ViewCount']
views_per_vcs.head()
In [ ]:
monythly_views = views_per_vcs.resample("1M").sum().cumsum()
monythly_views.head()